Skip to content

feat: support pretrained_to_huggingface functionality for CosyVoice3 RL trainging#1890

Open
Sakkana wants to merge 1 commit into
FunAudioLLM:mainfrom
Sakkana:main
Open

feat: support pretrained_to_huggingface functionality for CosyVoice3 RL trainging#1890
Sakkana wants to merge 1 commit into
FunAudioLLM:mainfrom
Sakkana:main

Conversation

@Sakkana
Copy link
Copy Markdown

@Sakkana Sakkana commented May 18, 2026

support pretrained_to_huggingface functionality for CosyVoice3 RL trainging


Summary

Support pretrained torch model conversion to huggingface model for RL training.

1. Token Design

Item CosyVoice2 CosyVoice3
Base speech tokens 6561 6561
Extra control tokens <|eos1|> <|eos2|> <|eos3|> <|sos|> <|task_id|> (+5) 200 extended slots + <|sos|> <|eos|> <|task_id|> (+203)
total_speech_tokens 6564 6761

CV3 folds control tokens into the speech token space and uses an alias map to redirect them. CV2 simply appends them after the vocab.

2. Special Token Vocabulary

CV3 introduces phoneme-level tokens absent in CV2:

  • English ARPAbet: [AA], [AE], [AH], [B], [CH] ...
  • Mandarin pinyin with tones: [ā], [ǎo], [iāng], [uán] ...
  • New system control token: <|endofsystem|>

3. lm_head Construction

CosyVoice2 CosyVoice3
Bias Yes, initialized to -inf No (bias=False)
Weight injection Absolute offset indexing slice(speech_start_idx, speech_end_idx)
Alias token handling None Copies weights from source token into alias token rows

4. Input Embeddings

CV2 explicitly copies llm_embedding weights for <|sos|> and <|task_id|> into the input embedding table. CV3 drops llm_embedding entirely and handles everything through the alias mechanism.

5. EOS Token Configuration

CV2 registers three separate EOS token IDs:

eos_token_ids = [offset+6561, offset+6562, offset+6563]

CV3 uses both alias and real IDs as a dual fallback:

llm.generation_config.eos_token_id = [alias_eos_token_id, real_eos_token_id]

Test

GPU: 8 x B200 + Triton reward server (SenseVoice) + Verl + GRPO adv_estimator

- gt_text: 
Nathy is still leading by fifteen thousand! We need one gift to unlock our bonus mission. Who is saving us?, 
- hyp_text: 
nay is still leading by fifteen thousand we need one gift to unlock our bonus mission who is saving us, 
reward_val: 0.851114966376682

Note:

Currently, the WER calculation and reward functions are self-defined. For ASR, we adopt the original SenseVoice implementation from the repository only for demonstration purposes without text frontend regularization. Using Whisper can achieve better performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant